Assignment 4: Classification and Clustering

1. Overview

This assignment is about classification and clustering. We’re looking at a dataset put together from sensus data, and seeing how crime rate varies across multiple correlated variables.

The methods used are linear discriminant analysis (LDA) and k-means clustering.

2. The dataset

Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here (https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html). (0-1 points)
library(MASS)
data("Boston")

Dataset structure

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

There are 506 rows with 14 columns. The dataset seems to originally have been put together to analyze housing values across the suburbs. From the paper cited on the stat.ethz.ch site: “Harrison, D. and Rubinfeld, D.L. (1978) Hedonic housing prices and the demand for clean air”, we can find that a row represents a “SMSA census tract”, so, a census area in Boston containing some number of housing units.

The columns contain social statistics related to these census areas (e.g. crim = crime rate, ptratio = pupil-teacher ratio), data about the housing units in the area (rm = avg # of rooms per unit, medv = median housing unit value, age = prop. houses built before 1940), and data about the location of the area (e.g. dis = weighted mean of distances to employment centers, chas = 1 if by Charles River, 0 if not by Charles River).

Some of the columns are a little counter-intuitive or difficult to interpret. E.g., the column age is the proportion of houses built before 1940, and the column lstat is the proportion of the population that is lower status. From the Harrison & Rubinfield paper, lower status means: “1/2 * (proportion of adults without some hig h school education and proportion of male workers classified as laborers)”.


Ok, before we move forward, we did see a small issue here, let’s change chas from an integer to a boolean.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Boston_explore <- Boston %>% mutate(chas = chas == 1)
str(Boston_explore)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Summary

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)
summary(Boston_explore)
##       crim                zn             indus          chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Mode :logical  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   FALSE:471      
##  Median : 0.25651   Median :  0.00   Median : 9.69   TRUE :35       
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14                  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10                  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74                  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Looking at this summary, all columns are definitely not created equal. I am mostly looking at the difference between the mean and median, which — if they differ by much — can indicate that a variable is not normally distributed. Some of the columns are close to normal distributions, but e.g. zn has a median of 0 and a mean of 11.36. Other highly skewed columns are rad, crim, tax, chas, and black.

3. Graphical summary

With ggpairs

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
Boston_explore %>%
  ggpairs(lower=list(combo=wrap("facethist",binwidth=0.5))) # Have to add the lower arg so ggpairs doesn't complain
fig. 3.1, Correlation matrix

fig. 3.1, Correlation matrix

Ok, lot’s to unpack here, let’s start with the a visual check of each variable’s distribution (the diagonal). Almost none of the columns look normally distributed, with perhaps the exception of rm, the number of rooms.

There are lots of interesting correlations, just looking at the scatter plots, the three values rm, medv, and lstat seem to have strikingly strong relationships with each other with medv, which makes sense to me.

The rad variable, which is an “index of accessibility to radial highways, is clearly a bimodal, or one could even say a split distribution. A subset of areas have a much higher index than the others, and in the scatter plots, this clearly visible line of that higher-index population seems to consistently cover different ranges of the other variable than the lower-index population. The effect is most clearly noticeable in the crim, tax, nox, dis and black scatter plots.

dis and nox also have a strikingly logarithmic-looking relationship.

In general, nearly every variable seems to be correlated with every other variable, excepting the chas (area is by the Charles river) column.

With corrplot

library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(Boston_explore), method="circle")
fig. 3.2, Correlation matrix with corrplot

fig. 3.2, Correlation matrix with corrplot

Using corrplot, we lose some information, but get a better overview of where the correlations are strongest.

We see strong correlations (large balls) between: - dis and (zn, indus, nox, and age) - tax and rad, and this is a very strong correltaion, they seem to capture much of the same variation within them - tax and (crim, indus, and nox) - ditto for rad - lstat and medv

4. Standardize and summarize

Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)

Let’s run the scale function on the dataset.

Boston_scaled <- as.data.frame(scale(Boston_explore))
summary(Boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

We’ve now normalized the columns by subtracting the mean and dividing by the standard deviation such that, if they were normally distributed, they would now be standard normally distributed.

Create a categorical crime rate column

# Create quartile class
Boston_scaled$crim <- as.numeric(Boston_scaled$crim)
Boston_scaled$crime <- cut(Boston_scaled$crim, breaks = quantile(Boston_scaled$crim), include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))

# Drop crim
Boston_scaled <- dplyr::select(Boston_scaled, -crim)

We’ve split the crime rate column into a categorical variable defining in which quartile of the crime rate distribution the sensus area is in.

Create training and test datasets

set.seed(179693716) 
n <- nrow(Boston_scaled)
split <- sample(n,  size = n * 0.8)

train <- Boston_scaled[split,]
test <- Boston_scaled[-split,]

Now we’ve split the dataset into two: 80% of rows are in the training set, 20% are in the test set.

5. Linear discriminant analysis

Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot. (0-3 points)
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2475248 0.2351485 0.2623762 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       0.85485581 -0.8956178 -0.08120770 -0.8436157  0.4413548 -0.8603410
## med_low  -0.09212737 -0.2763238  0.04263895 -0.5633101 -0.1877614 -0.3785908
## med_high -0.38704309  0.1791469  0.26643202  0.3710782  0.1054813  0.4474274
## high     -0.48724019  1.0170298 -0.04947434  1.0820880 -0.4230980  0.8279971
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8290333 -0.6897369 -0.7642825 -0.43374243  0.3887547 -0.77819536
## med_low   0.3902633 -0.5443053 -0.4238660 -0.08385135  0.3219258 -0.11887074
## med_high -0.3933098 -0.4088466 -0.3025121 -0.20555126  0.1295279  0.06516648
## high     -0.8564037  1.6390172  1.5146914  0.78181164 -0.8395683  0.90684062
##                 medv
## low       0.52171440
## med_low  -0.04934231
## med_high  0.14052867
## high     -0.73164690
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.12991855  0.717286290 -1.11409479
## indus    0.01318562 -0.275547338  0.12479393
## chas    -0.08510193 -0.049959856  0.17108503
## nox      0.50033989 -0.751262885 -1.27531579
## rm      -0.10015731 -0.108297341 -0.19190232
## age      0.21057791 -0.358696692 -0.14782657
## dis     -0.05526417 -0.347261139  0.38293342
## rad      3.16593191  1.032967549 -0.36643042
## tax     -0.01168751 -0.096843332  1.10072573
## ptratio  0.12524469  0.004459516 -0.41009220
## black   -0.12747580 -0.014860435  0.07573253
## lstat    0.22137355 -0.316638242  0.31488347
## medv     0.17976929 -0.444140572 -0.22238434
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9509 0.0340 0.0151

Biplot

arrows <- function(x, scale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = scale * heads[,choices[1]], 
         y1 = scale * heads[,choices[2]], col=color, length = arrow_heads)
  text(scale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col=classes, pch=classes)
arrows(lda.fit, scale = 1.5, color = "#ee8855")
fig. 5.1 LDA biplot

fig. 5.1 LDA biplot

We see that out of the two first linear discriminants, LD1 nearly perfectly separates the data into two clusters: those with high crime rate, and those with other values. rad has the clearly highest coefficient in LD1, which can be seen both from the biplot and the LDA summary.

LD2 seems to find another axis within the data that explains a smaller effect. The largest coefficients in LD2 belong to nox, medv, and zn.

6. Validation in test set

# Drop the result variables
facit <- test$crime
test <- dplyr::select(test, -crime)

Let’s predict the crime rate quartiles in the test set and cross tabulate:

# Predict classes in the test data
lda.pred <- predict(lda.fit, newdata = test)

# Do a confusion matrix
tab <- table(correct = facit, predicted = lda.pred$class)
tab
##           predicted
## correct    low med_low med_high high
##   low       15       9        0    0
##   med_low    5      13        8    0
##   med_high   1      10       19    1
##   high       0       0        0   21
nrow(test)
## [1] 102

and here’s the same table as a confusion matrix:

image(tab)
fig. 6.1 Confusion matrix of the LDA fit

fig. 6.1 Confusion matrix of the LDA fit

The confusion matrix shows that the model has found an axis that aligns very well with the crime rate quartile category. Most predictions are correct (68/102), the second most common case is being off by one class (33/102) and then off by two classes (1/102).

68/102 ~= 67% is not perfect but it is a lot better than choosing by random which should give us a correct prediction 25% of the time. It looks like the model can be used to make a decent predictor for whether an area has a high or non-high crime rate (the lower left of the matrix vs. the top right), for that predictor, we would have a correct classification rate of 101/102, nearly 100%!

6.Extra: let’s try with only rad

lda.radfit <- lda(crime ~ rad, data = train)
lda.radpred <- predict(lda.radfit, newdata = test)
radtab <- table(correct = facit, predicted = lda.radpred$class)
image(radtab)
fig. 6.2 Confusion matrix with a single variable LDA fit

fig. 6.2 Confusion matrix with a single variable LDA fit

Using only rad gives us a model that seems to have exactly the same predictive power in the high vs not high case, but loses information in the lower quartiles. This fits with our earlier analysis of how LD1 was mostly rad and was able to carve out most of the high crime rate areas from the rest.

7. K-means

The analysis so far suggests there are at least two clear clusters in the data, so we could just choose k = 2, but let’s check with the total within cluster sum of squares what a good choice for k would be.

I will take five samples and average them, and plot the standard deviation of the twcss for each k as error bars, this should give us a more reliable plot than just taking one sample.

# determine the number of clusters
#k_max <- 10

# calculate the total within sum of squares, take 5 samples to stabilize the variance 
twcss1 <- sapply(1:10, function(k){set.seed(100); kmeans(Boston, k)$tot.withinss})
twcss2 <- sapply(1:10, function(k){set.seed(123); kmeans(Boston, k)$tot.withinss})
twcss3 <- sapply(1:10, function(k){set.seed(321); kmeans(Boston, k)$tot.withinss})
twcss4 <- sapply(1:10, function(k){set.seed(130); kmeans(Boston, k)$tot.withinss})
twcss5 <- sapply(1:10, function(k){set.seed(949); kmeans(Boston, k)$tot.withinss})

df <- as.data.frame(tibble(twcss1,twcss2,twcss3,twcss4,twcss5, k= seq(1,10)))
df$twcss <- rowMeans(select(df,c(twcss1,twcss2,twcss3,twcss4,twcss5)))
df <- df %>% rowwise() %>% mutate(twcss_var = var(c(twcss1,twcss2,twcss3,twcss4,twcss5)))
df %>% ggplot(aes(x=k, y=twcss)) +
  geom_line() +
  geom_errorbar(aes(ymin=twcss-sqrt(twcss_var),ymax=twcss+sqrt(twcss_var),color="red"))+
  theme(legend.position="none") +
  scale_x_continuous(breaks=df$k) +
  scale_y_continuous(breaks=seq(1000000,10000000,2000000))
fig. 7.1 k-means twcss plot

fig. 7.1 k-means twcss plot

It does look like the plot agrees that k = 2 gives a very good fit. The k-means algorithm seems to always find the same clusters here, because the error bars are attached to each other, indicating that the twcss measure is constant here.

k=3 is also a potential choice, although less clear to me. After that, however, the variance increases greatly and the twcss delta starts giving minimal returns, which indicates that there isn’t a clear structure to the data which would guide the clustering.

I will go with k=2, as that seems to match what we saw in the data earlier, and the clusters are very stable.

Boston_kmeans <- Boston %>% scale %>% as.data.frame
set.seed(101)
# k-means clustering
k2m <- kmeans(Boston_kmeans, centers = 2)
summary(k2m)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       28    -none- numeric
## totss          1    -none- numeric
## withinss       2    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           2    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
k2m$centers
##         crim         zn      indus         chas        nox         rm
## 1 -0.3894158  0.2621323 -0.6146857  0.002908943 -0.5823397  0.2446705
## 2  0.7238295 -0.4872402  1.1425514 -0.005407018  1.0824279 -0.4547830
##          age        dis        rad        tax    ptratio      black      lstat
## 1 -0.4331555  0.4540421 -0.5828749 -0.6291043 -0.2943707  0.3282754 -0.4530491
## 2  0.8051309 -0.8439539  1.0834228  1.1693521  0.5471636 -0.6101842  0.8421083
##         medv
## 1  0.3532917
## 2 -0.6566834

It seems that even this k=2 means clustering has found the high-crime rate cluster. Let’s confirm that with a visualization.

ggpairs(Boston_kmeans[c("crim", "rad", "tax", "black", "age", "medv", "dis", "zn", "rm", "lstat")], aes(color=factor(k2m$cluster), fill=factor(k2m$cluster)), lower=list(combo=wrap("facethist",binwidth=0.5)))
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero
fig. 7.1 k-means pairs

fig. 7.1 k-means pairs

It seems that the model has picked two clusters that have the following relative position to each other:

  • the red cluster has a much lower crime rate
  • the red cluster has a lower radial highway access index
  • the red cluster has a lower tax rate
  • the red cluster has a much higher proportion of black residents
  • the red cluster has a lower proportion of buildings built prior to 1940
  • the red cluster has a similar median value distribution, shifted towards a higher evaluation (excepting a blue bump right at the top of the evaluation range)
  • the blue cluster has a much smaller distance to employment centers
  • the blue cluster has a much smaller proportion of residential land zoned for large plots
  • the red cluster has a much smaller proportion of single room apartments and other small housing units
  • the red cluster has a smaller proportion of working class people and non-educated adults

So it seems we have found a blue cluster with a lot of business activity (high tax rate, close to employment centers), with access to arterial highways, and high density building (lower number of rooms, zoned for smaller plots) and a red cluster of areas with less business activity, in relatively quiet regions with longer commutes and more working class or non-educated people, and a much higher proportion of black residents.

So it seems like the red regions are new developments, new suburbs farther away from the city, and there may be some price discrimination in the house prices (medv) connected with the high proportion of black residents living there. You can see effects of segregationist US housing policy in the data. E.g., the 1949 housing act set up a framework to subsidize public housing for whites with clauses forbidding resale to black people, which means that black people paid more for housing (see e.g. “Abramovitz & Smith, The Persistence of Residential Segregation by Race, 1940 to 2010: The Role of Federal Housing Policy,Families in Society, Vol. 102, Issue 1” for more).

7.2 Let’s try with k=3

Why not try with k=3 for good measure? Maybe we can find additional structure in the data.

set.seed(124293)
k3m <- kmeans(Boston_kmeans, centers = 3)
summary(k3m)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       42    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
k3m$centers
##         crim         zn      indus        chas        nox         rm        age
## 1 -0.4076669  1.1526549 -0.9877755 -0.10115080 -0.9634859  0.7739125 -1.1241828
## 2 -0.3688324 -0.3935457 -0.1369208  0.07398993 -0.1662087 -0.1700456  0.1673019
## 3  0.8942488 -0.4872402  1.0913679 -0.01330932  1.1109351 -0.4609873  0.7828949
##           dis        rad        tax     ptratio      black       lstat
## 1  1.05650031 -0.5965522 -0.6837494 -0.62478941  0.3607235 -0.90904433
## 2 -0.07766431 -0.5799077 -0.5409630 -0.04596655  0.2680397 -0.05818052
## 3 -0.84882600  1.3656860  1.3895093  0.63256391 -0.7083974  0.90799414
##          medv
## 1  0.84137443
## 2 -0.04811607
## 3 -0.69550394
ggpairs(Boston_kmeans[c("crim", "rad", "tax", "black", "age", "medv", "dis", "zn", "rm", "lstat")], aes(color=factor(k3m$cluster), fill=factor(k3m$cluster)), lower=list(combo=wrap("facethist",binwidth=0.5)))
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero
fig. 7.1 k-means pairs

fig. 7.1 k-means pairs

Here we see that the blue cluster in this plot is roughly the same as the blue cluster from the k=2 clustering. The k=2 red cluster has here been split into red and green.

The differences in the red and green clusters seem to be: - the red cluster has higher values of zn more big plots - the red cluster has a smaller proportion of older buildings - the red cluster consists of exclusively black neighbourhoods, while the green cluster has some spread - the green cluster is between the red and blue clusters when it comes to proportion of laborers and uneducated adults

Maybe the red cluster matches better with those black neighbourhoods built more recently, which the 1949 Housing Act and the Federal Housing Authority regulations apply to? I don’t know for sure, more analysis would be required.